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Abstract 

The  new  deployments  of  the  German  Federal  Armed  Forces  cause  the  necessity  to  analyze  large 
quantities  of  Human  Intelligence  (HUMINT)  reports.  These  reports  are  characterized  by  a  large 
topical  and  linguistic  variety.  Therefore,  they  are  good  candidates  for  applying  techniques  from 
computational  linguistics.  In  this  paper,  the  ZENON  project  is  described,  in  which  an 
information  extraction  approach  is  used  for  the  (partial)  content  analysis  of  English  HUMINT 
reports  from  the  KFOR  (Kosovo  Force)  deployment  of  the  Bundeswehr.  The  overall  objective  of 
this  research  is  to  realize  a  graphically  navigatable  Entity-Action-Network.  The  information 
about  the  actions  and  named  entities  are  identified  from  each  sentence  and  the  content  of  the 
sentences  are  formally  represented  in  typed  feature  structures.  These  structures  can  be  combined 
and  presented  in  a  navigatable  network.  After  a  short  introduction,  the  information  extraction 
approach  is  explained.  The  ZENON  project  is  described  in  detail.  English  HUMINT  reports  from 
the  KFOR  deployment  form  the  basis  for  the  development  of  the  experimental  ZENON  system. 
These  reports  are  used  to  build  a  specialized  text  micro-corpus  with  semantic  annotations.  This 
KFOR  text  corpus  is  described  as  well. 

1.  Introduction 

On  one  hand,  the  new  deployments  of  the  German  Federal  Armed  Forces  cause  the  necessity  to 
analyze  large  quantities  of  HUMINT  reports.  These  reports  are  characterized  by  a  large  topical 
and  linguistic  variety.  For  that  reason,  they  are  good  candidates  for  applying  techniques  from 
computational  linguistics  to  analyze  natural  languages.  On  the  other  hand,  the  processing  of 
human  language  was  identified  as  a  critical  capability  in  many  future  military  applications  (cf. 
[Steeneken,  1996]).  Especially  the  content  analysis  of  free-form  texts  is  important  for  any 
information  operation  of  the  Network  Centric  Warfare  (NCW)  concept  (s.  (NCW,  2001],  p.  5- 
15).  We  set  up  the  research  project  ZENON1 2,  in  which  an  information  extraction  approach  is 
used  for  the  (partial)  content  analysis  of  English  HUMINT  reports  from  the  KFOR  deployment 
of  the  Bundeswehr.  The  overall  objective  of  this  research  is  to  realize  a  graphically  navigatable 
Entity-Action-Network.  The  information  about  the  actions  and  named  entities  are  identified  from 
each  sentence  and  the  content  of  the  sentences  are  formally  represented  in  typed  feature 
structures.  These  structures  can  be  combined  and  presented  in  the  navigatable  network. 

The  ZENON  project  is  based  on  the  results  of  the  former  SOKRATES"  project.  In  this  project 
we  applied  information  extraction  to  the  analysis  of  German  free-form  battlefield  reports  (cf. 
[Casals,  2004a],  [Casals,  2004b],  [Frey,  2004]  ,  [Hecking,  2001],  [Hecking,  2002],  [Hecking, 


1  according  to:  Zenon  of  Citium,  336  BC  -  264  BC,  philosopher,  founder  of  the  Stoicism 

2  according  to:  Socrates,  469  BC  -  399  BC,  philosopher 
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2003a],  [Hecking,  2003b],  [Hecking,  2004a],  [Hecking,  2004b],  [Hecking,  2004c],  [Schade, 
2003a],  [Schade,  2003b]).  The  SOKRATES  prototype  was  able  to  process  written  battlefield 
reports  (e.g.,  messages  about  hostile  movements,  deployments)  in  German.  The  reports  were 
analyzed,  represented  in  feature  structures  and  semantically  enhanced  with  the  help  of  an 
ontology.  With  the  SOKRATES  prototype  we  showed  the  general  applicability  of  the 
Information  extraction  (IE)  technology  for  military  purposes. 

This  paper  is  structured  as  follows.  First,  a  short  introduction  into  the  information  extraction 
approach  is  given.  Then,  the  ZENON  project  is  described  in  detail.  English  HUMINT  reports 
from  the  KFOR  deployment  form  the  basis  for  the  development  of  the  experimental  ZENON 
system.  These  reports  are  used  to  build  a  specialized  text  micro-corpus  with  semantic 
annotations.  This  KFOR  text  corpus  is  described  as  well. 

2.  Information  Extraction 

In  the  last  decades  various  techniques  for  processing  spoken  and  written  natural  languages  were 
developed  (e.g.  speech  recognizer  in  dictation  systems,  machine  translation,  grammar  checking). 
IE  is  an  engineering  approach  (cf.  [Appelt,  1999])  for  content  analysis  of  free-form  texts  based 
on  results  of  computational  linguistics.  Each  IE  system  is  tailored  to  a  specific  domain  and  task. 
IE  uses  a  shallow  syntactic  approach  (cf.  [Hecking,  2003b]),  i.e.  that  only  parts  of  the  sentences 
(so-called  ‘chunks’)  are  processed  with  finite  state  automatons  or  transducers. 

During  the  IE  relevant  information  about  the  Who,  What,  When,  etc.  in  natural  language  texts  is 
identified,  collected,  and  normalized  (cf.  [Pazienza,  1999],  [Hecking,  2004a]).  The  relevant 
information  is  described  through  patterns  called  templates.  These  domain  and  task  specific 
templates  represent  the  meaning  of  the  relevant  information.  During  the  IE  task  the  templates  are 
filled  with  the  extracted  information.  One  possibility  to  realize  the  templates  is  to  use  typed 
feature  structures  (cf.  [Hecking,  2004b]).  Therefore,  IE  can  be  seen  as  the  process  of 
normalizing  free-form  text  into  a  defined  semantic  structure. 

To  realize  an  IE  system,  language-specific  resources  (lexicon,  grammar)  and  appropriated 
parsing  software  are  necessary. 

In  order  to  achieve  robust  and  efficient  IE  systems,  domain  knowledge  must  be  integrated  and 
shallow  algorithms  must  be  used.  The  domain  knowledge  is  tightly  integrated  with  the  language 
knowledge,  e.g.,  the  name  ‘Leopard’  in  the  lexicon  has  the  categorical  information  ‘tank’.  This 
association  between  words  and  semantic  information  is  domain-specific  and  has  to  be  change  for 
other  applications. 

The  IE  process  itself  is  divided  into  sub  steps.  After  tokenizing  the  text,  the  sentence  boundaries 
must  be  identified.  Then,  the  morphological  component  identifies  the  word  stems,  the 
abbreviation,  and  detects  the  syntactic  information  (e.g.,  grammar  case  and  gender).  After  this, 
the  chunk  parsing  with  transducers  selects  parts  of  the  natural  language  text  that  are  relevant  for 
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*  Tokenizing 

*  Sentence  splitting 

*  POS  tagging 

*  Gazetteer 

*  NE  Recognition 

*  Morphological  analysis 


*  Detect  verb  phrases 

*  Extract  action  types 

*  Extract  sentence  content 


GATE 


Information  Extraction  Presentation  System  (IEPS) 


Figure  1:  The  ZEN ON  processing  chain 


the  specific  information  extraction  task.  The  chunks  are  then  used  to  instantiate  the  templates, 
which  represent  the  action/event  descriptions.  They  are  the  result  of  the  IE  process. 

The  IE  is  used  as  the  core  natural  language  processing  technique  in  the  ZENON  project. 

3.  The  ZENON  Project 
Outline 

Starting  with  English  HUMINT  reports  (and  a  list  of  the  city  names)  from  the  KFOR 
deployment  of  the  German  Federal  Armed  Forces  we  have  realized  in  our  ZENON  project  a 
prototype  that  is  able  to  do  a  (partial)  content  analysis  of  these  reports  (cf.  [Hecking,  2005a]). 
The  content  of  these  KFOR  reports  are  from  a  wide  spectrum.  Apart  from  descriptions  of 
conflicts  between  ethnic  groups,  tensions  between  political  parties,  information  about 
infrastructure  problems,  etc.  there  are  also  reports,  which  concern  individuals  or  other  entities. 
Statements  of  the  form  "A  meets  B ",  "A  marries  C',  "A  shoots  B ',  etc.  contains  information 
about  activities/events  and  involved  entities.  This  information,  completed  with  location  and 
time  data,  is  combined  into  a  graphically  navigatable  Entity- Action-Network  (e.g.;  with  a 
person  in  the  center  of  the  network).  The  intelligence  analysts  can  use  this  network  to 
navigate  through  the  content  of  the  reports. 

Since  most  of  the  reports  are  in  English,  GATE  (General  Architecture  for  Text  Engineering, 
cf.  [Cunningham,  2002])  was  selected  as  the  used  toolbox.  GATE  is  an  architecture,  a  free 
open  source  framework  (SDK)  and  graphical  development  environment  for  Natural  Language 
Engineering  and  offers  a  lot  of  tools,  which  are  used  to  realize  the  natural  language  processing 
parts  of  the  ZENON  prototype  (e.g.,  morphological  analyzer,  part-of-speech  (POS)  tagger, 
pre-defined  transducer  to  recognize  English  verbal  phrases,  chunk-parsing).  The  functionality 
to  select  and  combine  the  extracted  information  from  different  sentences  and  different  reports 
is  realized  by  the  Information  Extraction  Presentation  System  (IEPS).  IEPS  is  a  graphical  tool 
to  visualize  information  extracted  from  free-form  texts. 

In  Figure  1  the  ZENON  processing  chain  is  shown.  HUMINT  reports  are  fed  into  the  first 
sub-component.  In  this  component  the  natural  language  text  is  tokenized  (i.e.,  find  words, 
numbers,  etc.),  the  sentence  boundaries  are  detected,  the  part-of-speech  (i.e.,  whether  it's  a 
noun,  a  verb,  etc.)  is  determined,  simple  names  of  cities,  regions,  military  organizations  etc. 
are  annotated  (through  the  Gazetteer),  named  entities  (i.e.,  complex  names  of  e.g.  political 
organizations,  person  names,  etc.)  are  recognized  and  a  morphological  analysis  is  done.  The 
result  of  this  sub-component  are  the  annotated  sentences  of  the  reports.  The  second  sub¬ 
component  uses  these  annotations  to  extract  the  action  type  (e.g.,  'kill')  starting  with  the  verb 
of  the  sentence.  If  the  action  type  is  determined  the  other  parts  of  the  sentence  (e.g.,  subject, 
object,  time  expressions)  are  located  and  formally  represented  in  typed  feature  structures. 
These  structures  are  coded  in  XML  (Extensible  Markup  Language)  format  and  represent  the 
output  of  the  natural  language  part  of  the  ZENON  prototype.  In  the  third  sub-component 
(IEPS)  the  extracted  content  of  different  reports  can  be  combined  and  selected  according  to 
predefined  XSLT  (Extensible  Stylesheet  Language  Transformation)  sheets.  The  result  of  the 
analysis  can  be  navigated  interactively. 

Extraction  of  Named  Entities 

An  important  processing  step  during  the  natural  language  processing  is  the  recognition  of  the 
domain-  and  application- specific  named  entities.  In  the  ZENON  prototype  transducers  for  the 
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recognition  of  the  following  named  entities  were  developed:  City,  Company,  Coordinates, 
Country,  CountryAdj,  Currency,  Date,  GeneralOrg,  MilitaryOrg,  Number,  Percent,  Person, 
PoliticalOrg,  Province,  Region,  River,  Time  and  Title.  An  example  is  shown  in  Figure  2. 


Rule:  PersonNamel 

/////////////////////////////////////////////////////////// 

//  Recognizes:  "Mr.  Bedredin  SHEHU",  "Mrs  SHEHU" 

//  Output:  TempPerson{title,  firstName,  lastName, 

gender,  'PersonNamel'} 

( 

//  Titel 

( 

( {PersonTitle } )  :title 
({ Token . string  ==  "."})? 

) 

//  First  name 

( 

({ Lookup .majorType  ==  person_first }): firstName 

)  ? 

//Last  name 

({ Token . category  ==  NNP , 

Token. orth  ==  allCaps }): lastName 

( 

{ Token . string  == 

{ Token . category  ==  NNP} 

)  ? 

) :person 
—  > 

{  .  .  .  } 

Figure  2:  Named  entity  recognizer  'PersonNamel' 

Extraction  of  Verb  Phrases,  Action  Types  and  Sentence  Content 

GATE  offers  various  transducers  to  recognize  the  English  verb  groups.  We  have  adapted  and 
extended  these  transducers  to  fit  our  application.  In  addition  to  finite  and  non-finite  verbal 
phrases  also  modal  verb  phrases,  participles  and  special  composed  verb  expressions  are 
recognized.  Figure  3  shows  an  example. 
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Rule:  FVGPrePerPasNeg 

//Recognizes:  Present  Perfect  Passive  Negative: 

"hasn't  been  eaten" 

//Pattern:  (has  |  have)  not  been  VBN 

//Output:  VG{adverb,  infinitive,  neg='yes', 

tense= ' PrePer 1 ,  type='FVG',  voice= ' passive ' } 

( 

( 

{ Token . string  ==  "has"}  | 

{ Token . string  ==  "have"} 

) 

(NEGATION) 

{ Token . string  ==  "been"} 

(AD VS ) : adverb 

({ Token . category  ==  VBN}): verb 

)  :x 

—  > 

{  .  .  .  } 

Figure  3:  Verb  phrase  transducer  'FVGPrePerPasNeg' 

Based  on  the  recognized  verb  groups  different  action  types  can  be  detected  (e.g.,  from  the 
infinitive  of  'murder',  'kill',  'decapitate',  ...  the  action  class  'kill').  After  detecting  the  action 
type  the  verb  phrase  and  other  parts  of  the  sentence  must  be  combined.  In  the  ZENON  project 
we  use  the  Semantic  Frames  from  the  FrameNet  project  (cf.  [FrameNet])  to  realize  this 
combination.  Semantic  frames  are  schematic  representations  of  situation  types  (eating,  killing, 
spying,  classifying,  etc.)  together  with  lists  of  the  kinds  of  participants,  objects,  and  other 
conceptual  roles  that  are  seen  as  components  of  such  situations.  These  semantic  arguments  are 
called  the  frame  elements  of  the  frame.  Figure  4  shows  an  example.  The  core  (must  exist) 
frame  elements  for  the  frame  'killing'  are  CAUSE  or  KIEFER  and  VICTIM.  In  the  example 
sentence  'John'  fills  the  role  KIEFER  and  'Martha'  fills  the  role  VICTIM. 


Semantic  Frame 
'killing': 

A  KILLER  or  CAUSE  causes  the  death  of  the  VICTIM. 

Core  frame 
elements: 

CAUSE,  KILLER,  VICTIM 

Non-core 
frame  elements: 

DEGREE,  DEPICTIVE,  INSTRUMENT,  MANNER, 
MEANS,  PLACE,  PURPOSE,  REASON,  RESULT,  TIME 

Example  sentence: 

[John  killer]  DROWNED  [Martha  victim]- 

Figure  4:  Frame  'killing' 
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Associated  with  each  Semantic  frame  are  examples  with  typical  syntactic  realization  of  the 
frame  elements.  These  examples  and  examples  from  the  KFOR  reports  form  the  basis  to 
construct  the  transducers,  which  produce  the  sentence  content. 

During  the  processing,  the  associated  Semantic  Frame  is  inferred  from  the  detected  action 
type.  With  the  identified  Semantic  Frame  the  core  and  non-core  frame  elements  are  give. 
Recognized  named  entities,  POS  tagging  and  expressions  from  the  sentences  are  used  to  fill  in 
the  frame  elements. 

For  example,  after  processing 

John  Mueller  and  four  other  persons  were  killed  in  an  explosion 
incident  in  GOSTIVAR  area.3 

the  named  entities  for  the  person,  the  city  and  the  number  (see  Figure  5)  and  the  verb  group 
(see  Figure  6)  were  produced.  This  information  together  with  miscellaneous  language 
material  from  the  sentence  is  used  to  produce  the  content  representation  of  the  whole  sentence 
(see  Figure  7).  The  type  of  the  sentence  is  'kill'.  For  the  victims  there  are  two  pieces  of 
information.  The  value  of  kvictim'  is  the  identified  person.  The  name  of  the  victim  doesn't 
span  the  whole  expression  in  front  of  the  verb  phrase.  Therefore,  there  is  more  information 
about  possible  victims  in  this  sentence.  For  not  loosing  any  information,  the  whole  expression 
is  stored  in  ':victimAll'.  The  cause  of  the  killing-event  is  stored  in  'xauseAll'.  The 
information  where  the  killing  took  place  is  given  through  'rplace'  and  ':placeAll'. 


{  { 

{ 

:type=:person, 

:type=:city, 

:type=:  number, 

:  fir  stName= John, 

:name=  GOSTIVAR, 

:value=4, 

:  lastN  ame=Mueller, 

} 

} 

} 

Figure  5:  Recognized  named  entities  (abbreviated) 


{ 

:type=:vg, 

:infinitive=kill, 

:typeVG=FVG, 

:tense=SimPas, 

:voice=passive, 

:  rule=FV  GS  imPasPas 

} 


Figure  6:  Recognized  verb  group  (abbreviated) 


3  The  names  are  not  the  real  ones. 
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:type=kill, 

:verb= 

{ 

:type=:vg, 

:infinitive=kill, 

:typeVG=FVG, 

:tense=SimPas, 

:voice=passive, 

:rule=FVGSimPasPas 

}, 

:victim= 

{ 

:type=:person, 

:  firstName= John, 

:  lastN  ame=Mueller, 

:start=534, 

:end=546, 

:  rule=PersonN  ame2 

}, 

:victimAll=  John  Mueller  and  four  other  persons, 

:causeAll=an  explosion  incident, 

:place= 

{ 

:type=:city, 

:  name=GO  STIV  AR, 

:rule=City 

}, 

:placeAll=  GOSTIVAR  area, 

:sentenceContent=John  Mueller  and  four  other  persons  were  killed  in  an 
explosion  incident  in  GOSTIVAR  area., 

:start=534, 

:end=624, 

:rule=killAction2 

} 

Figure  7:  Formal  representation  of  sentence  content 
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Information  Extraction  Presentation  System 

The  natural  language  processing  module  of  the  ZENON  prototype  creates  for  each  sentence  in 
each  KFOR  report  a  formal  representation  of  the  content.  This  contains  information  pieces 
about  activities,  events,  entities,  times  and  places.  These  pieces  are  now  put  together 
according  to  specific  analysis  requirements  (e.g.,  all  information  about  a  specific  person).  The 
result  of  this  recombination  is  a  graphically  navigatable  Entity-Action-Network.  The 
intelligence  analyst  can  use  this  network  for  faster  access  the  important  information  from  the 
used  set  of  reports. 

In  the  ZENON  prototype  the  above  describe  functionality  is  realized  by  the  Information 
Extraction  and  Processing  System  (IEPS,  cf.  [Casals,  2005]).  IEPS  is  a  graphical  software 
tool  (see  Figure  8)  for  visualizing  information  typically  extracted  from  free-form  texts  by  a 
natural  language  processing  system.  Additionally,  it  offers  a  framework  to  organize  all  the 
files  being  employed  during  the  processing  in  user-defined  scenarios  and  to  activate  the  IE 
process.  IEPS  represents  extracted  information  by  means  of  interactive  graphs.  The  results  of 
the  IE  system  can  be  filtered  in  different  ways,  either  by  hiding  unimportant  information,  or 
by  combining  results  from  different  reports  and  sentences.  This  is  done  by  using  XSLT.  The 
visual  interface  is  based  on  TouchGraph  (cf.  [TouchGraph]). 


Figure  8:  Information  Extraction  and  Processing  System  (IEPS) 
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4.  The  KFOR  Corpus 


4,498  military  reports  (mostly  in  English)  from  the  KFOR  deployment  of  the  German  Federal 
Armed  Forces  were  used  for  the  realization  of  the  ZENON  prototype.  From  these  reports  800 
were  manually  annotated  and  form  the  KFOR  Corpus4 .  This  corpus  is  a  specialized  micro  text 
corpus  (cf.  [McEnery,  2001,  p.  191]).  The  corpus  covers  886,000  tokens  and  contains 
the  annotations  in  different  layers  (cf.  [Hecking,  2005b]).  The  following  layers  are  available: 

•  Original  markups:  In  this  layer  those  parts  of  the  message  are  annotated,  which  are  already 
formatted  (e.g.  addressee,  topic,  source) . 

•  Token:  This  layer  contains  the  annotations,  which  are  supplied  by  the  tokenizer  and  the 
POS  tagger. 

•  Gazetteer:  In  this  layer  those  expressions  are  annotated,  which  were  identified  over  lists  of 
names  (e.g.,  first  names,  city  names). 

•  Sentence:  These  annotations  refer  to  sentences  and  begin  and  end  markers  of  comments. 

•  Named  entities:  City,  Company,  Coordinates,  Country,  CountryAdj,  Currency,  Date, 
GeneralOrg,  MilitaryOrg,  Number,  Percent,  Person,  PoliticalOrg,  Province,  Region, 
River,  Time  and  Title. 

•  Verb  Group:  The  verbal  phrases  are  annotated. 

During  the  creation  of  the  corpus  a  first  version  of  the  annotations  were  produced 
automatically.  These  annotations  were  then  checked  manually  and  corrected.  For  both 
working-steps  GATE  was  used.  The  corpus  is  represented  in 

•  the  GATE-specific  format, 

•  the  GATE-specific  format  in  XML,  and 

•  the  ANC  (American  National  Corpus)  stand-off  annotation  format. 

For  the  following  purposes  the  KFOR  corpus  is  used: 

1.  It  represents  the  basis  for  the  construction  of  the  IE  component.  The  lexicon  and  the 
transducers  are  optimized  towards  the  corpus. 

2.  The  performance  of  the  ZENON  IE  can  be  quantitatively  evaluated  relative  to  the 
KFOR  corpus. 

3.  The  KFOR  corpus  can  be  used  for  other  research  objectives  (e.g.,  complexity  of 
nominal  phrases,  word  sense  disambiguation,  machine  learning  of  grammatical 
structures,  etc.). 


4  Since  the  KFOR  corpus  is  classified,  it  is  not  freely  available. 
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4.  Conclusion 


In  this  paper,  the  ZENON  project  was  presented.  In  this  project  an  information  extraction 
approach  is  used  for  the  (partial)  content  analysis  of  English  HUMINT  reports  from  the 
KFOR  deployment  of  the  Bundeswehr.  First,  a  short  introduction  into  the  information 
extraction  approach  was  given.  Then,  the  ZENON  project  was  described  in  detail.  The  KFOR 
text  corpus  was  mentioned  as  well. 

A  first  version  of  the  prototype  was  constructed.  It  is  able  to  process  sentences  of  action  type 
'kill'.  This  will  be  extended  to  other  action  types.  On  the  natural  language  processing  side 
using  WordNet  will  extend  the  processing  capabilities. 
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■  General  problem:  There  are  a  lot 
(military  reports,  emails,  web  pages,  scientific  reports, 
documents,  ...)  which  can't  be  evaluated  due  to  missing 
specialists. 

■  Which  technical  possibilities  exist  of  automating  the  content 
extraction? 

Practical  approach:  Information  Extraction  (IE) 


KIE 


FGAN 


Forschungsinstitut  fur 

Kommunikation,  Informationsverarbeitung  und  Ergonomie 


1|  I  ntroduction  - 1 1 


Dr.  M.  Hecking 


■  Specific  problem:  content  extraction  of  HUMI  NT  reports 

■  ZENON  project:  The  overall  objective  is  to  realize  an 
experimental  system  for  (partial)  content  extraction  of 
HUMI  NT  reports  from  the  KFOR  deployment  of  the 
Bundeswehr  and  to  realize  a  possibility  to  evaluate  the  formal 
representation  of  the  content. 

■  For  the  realization  of  the  I E  module  approx.  4000  English 
HUMI  NT  reports  are  available. 
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■  For  the  realization  the  toolbox  GATE  is  used. 

■  The  ZENON  prototype  was  integrated  into  the  "CliCC"  system. 

■  "CliCC"  is  a  German  contribution  for  CWI D  2006. 

■  For  the  evaluation  of  the  ZENON  I E  the  KFOR  text  corpus  was 
developed. 
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■  I nformation  extraction  (IE)  is  the  task  of  identifying, 
collecting  and  normalizing  information  from  natural  language 
text. 

■  Relevant  information  about  the  Who,  What,  When,  etc.  is 
looked  for. 

■  The  information  of  interest  is  described  through  domain- 
specific  lexicon  rules  and  patterns  called  templates. 

■  During  the  I E  task  these  templates  are  filled  with  the 
collected  information. 

■  The  templates  are  domain  and  task  specific,  i.e.  for  each  new 
task  and  domain  they  must  be  newly  created. 
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„...  in  TUZLA: 
10.20  a.m. 


-  8  with  attached 
ZIS-3  and  1  with 
attached  T  12  - 


at  road  crossing  (CQ 
072368)  south  of 
MILESKIJ  (CQ  0737) 
to  the  north." 
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report 


addressee  ... 

objects 

vehicles 

message 

area 

location 

move 


Analysis 


Enti 


who,  when,  where 
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■  Technical  basis  of  the  information  extraction: 

♦  extensive  linguistic  knowledge 

general  and  application-specific  lexica 

general  and/or  application-specific  phrase  grammars 

general  (and/or  application-specific)  clauses  and 
sentence  grammars 

♦  cascaded  transducers,  i.e.  finite  state  automaton  that 
reads  from  the  input  and  writes  to  the  output 

♦  only  application- relevant  parts  of  the  texts  are  analyzed 
through  transducers  (shallow  parsing  techniques) 
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The  experimental  ZENON  project: 

■  The  I  nformation  Extraction  (I  E)  technology  is  used  for  the 
content  extraction. 


The  information  about  the  actions  and  named  entities  are 
identified  from  each  sentence  and  the  content  of  the 
sentences  are  formally  represented  in  typed  feature 
structures. 


These  structures  can  be  combined  and 
presented  in  a  graphically  navigatable 
Entity-Actio^Network. 
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*  Tokenizing 

*  Sentence  splitting 

*  POS  tagging 

*  Gazetteer 

*  NE  Recognition 

*  Morphological  analysis 


*  Detect  verb  phrases 

*  Extract  action  types 

*  Extract  sentence  content 


GATE 


*  Select  extracted  information 

*  Combine  extracted  information 


'  Htntt  i 


6JPzMrs332-ZugB  332  •  ZugB 


EXAC-n-Y-ATrockhorf* 
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GATE: 


■  "is  one  of  the  most  widely  used  human  language  processing 
systems  in  the  world." 

■  "comprises  an  architecture,  framework  (oriSDK)  and 
graphical  development  environment ..." 

■  "...  has  been  under  construction  in  Sheffield  since  1995." 

■  "The  system  has  been  used  for  many  language  processing 


projects;  in  particular  for  I  nformation  Extraction  in  many 
languages." 

■  "GATE  is  funded  by  the  Engineering  and  Physical  Sciences 
Research  Council  (EPSRC),  the  EU  and  commercial  users." 

■  http://gate.ac.uk/ 
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■  Chunk  parsing  of  named  entities  (NE):  City,  Company, 
Coordinates,  Country,  CountryAdj,  Currency,  Date, 
Genera iOrg,  MiHtaryOrg,  Number,  Percent,  Person, 
PoliticalOrg,  Province,  Region,  River,  Time  anti  Title 

■  Example: 

Rule:  PersonNamel 

( 

(  ({PersonTitle}):title  ({Token.string  == "."})?  ) 

(  ({Lookup.majorType  ==  person_first}):flrstName  )? 
({Token.category  ==  NNP,  Token.orth  ==  allCaps}):lastName 
({Token.string  ==  "-"{{Token.category  ==  NNP})? 

):person 
— > 

{...} 
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Determine  action  types: 

■  extraction  of  verb  phrases  (modal  verb  phrases,  participles, 
special  composed  verb  expressions) 

■  mapping  from  recognized  verb  groups  to  action  types  (e.g., 
from  the  infinitive  of  'murder',  'kill',  'decapitate',  ...to  action 
type  'kill'). 
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Sentence  analysis: 

■  The  basic  structures  for  the  semantic  sentence  analysis  are 
given  by  the  FrameNet- Project. 

■  Semantic  roles  are  specified  for  the  lexical  units  (frames,  verbs). 

■  Example:  Frame  Killing 

♦  Def. :  A  Kl  LLER  or  CAUSE  causes  the  death  of  the  VI CTI M. 

♦  Roles:  CAUSE,  Kl  LLER,  VI  CTI  M,  DEGREE,  I NSTRUMENT,  . . . 

♦  Sentence:  [J  ohn  Mueller  and  four  other  persons  victim]  were 
killed  in  [an  explosion  incident  cause]  'n  [GOSTI  VAR  area  place]- 

♦  Formal  representation  ... 
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:type=kill, 

:verb={:type=:vg,  :infinitive=kill,  :typeVG=FVG,  :tense=SimPas, 
:victim={:type=:person,  :firstName=John,  :lastName=Mueller, 
:victimAll=  John  Mueller  and  four  other  persons, 

:causeAll=an  explosion  incident, 

:place={:type=:city,  :name=GOSTIVAR, 

:placeAll=  GOSTIVAR  area, 

:sentenceContent=John  Mueller  and  four  other  persons  were  killed  ..., 
:start=534, 

:end=624, 

:  rule=killAction2 


} 
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I  nformation  Extraction  Presentation  System  ( I  EPS) : 

■  Graphical  presentation  of  the  extracted  information  (typed 
feature  structures) 

■  Scenario:  Name,  sets  of  input  and  output  files  (XML  coded 
feature  structures),  filter. 

■  Filter:  Description  of  the  transformation  with  XSLT 


FGAN 


Forschungsinstitut  fur 

Kommunikation,  Informationsverarbeitung  und  Ergonomie 


3|  ZENON  Project:  I  EPS  -  1 1 


Dr.  M.  Hecking 


input 

files 


graphical 

output 


formal 

output 
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...  move  ... 


Textual  Input  Output  1 

— V  feature:  lYPt - 

atomic:  UNIT 

9  feature:  ABBREVIATION 

atomic:  8JPzMrs332-ZugB 
©-  feature:  LOCATED 
<?  feature:  MESSAGE 
9  set 

9  feature- structure 
9  feature:  AREA 
9  feature- structure 
9  feature:  TYPE 

atomic:  CROSSING 
9  feature:  LOCATION 
9  feature- structure 

9  feature:  COORDINATES 

atomic:  32upd02900801 00 
9  feature:  TYPE 

atomic:  POINT 
9  feature:  QUALIFIER 
atomic:  AT 
^9  feature:  TYPE 

atomic:  MOVE 
9  feature:  THEME 
9  feature-structure 
9  feature:  OBJECTS 
9  set 

9  feature- structure 
9  feature:  COUNT 
atomic:  18 
9  feature:  TYPE 

atomic:  VEHICLE 

9  feature:  TYPE 

atomic:  THEME 

9  feature:  MEDIUM 
atomic:  LETTER 
9  feature:  ADDRESSEE 
9  feature- structure 
9  feature:  TYPE 
atomic:  UNIT 
9  feature:  TYPE 

atomic:  REPORT 

Last  modifications  (on  inmit/outmit  files) 
meldung_out_3287977 1 02.xml:  Thu  Mar  1 1  07:58:23  CET  2004 
ml.txt:  Thu  Mar  04  1 0:03:21  CET  2004 


...at  street  crossing 
(32UPD0290080100) 


18  vehicles 
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■  KFOR  corpus:  Approx.  4.000  Military  HUMINT  reports  from 
the  KFOR  deployment  of  the  Bundeswehr  are  the  basis  for 
the  realization  of  the  KFOR  corpus. 

■  specialized  micro  text  corpus 

■  886, 000  tokens 

■  The  corpus  is  classified. 
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■  Annotation  layers: 

>  Original  markups:  parts  of  the  message  which  are 
already  formatted  (e.g.  addressee,  topic,  source) 

>  Token:  annotations  from  the  tokenizer  and  the  POS 
tagger 

>  Gazetteer:  expressions  identified  through  lists  of  names 

>  Sentence:  sentences  begin  and  end,  markers  of 
comments 

>  Named  entities 

>  Verb  phrases 
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■  During  the  creation  of  the  corpus  a  first  version  of  the 
annotations  was  produced  automatically. 

■  These  annotations  were  then  manually  checked  and 
corrected. 

■  For  both  working-steps  GATE  was  used. 

■  The  corpus  is  represented  in 

♦  the  GATE-specific  format, 

♦  the  GATE-specific  format  in  XML, 

♦  an  stand-off  annotation  format. 
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For  the  following  purposes  the  military  messages  and  the 
corpus  can  be  used: 

♦  They  represent  the  basis  for  the  construction  of  the  I E 
component.  The  lexicon  and  the  transducers  were 
optimized  towards  the  corpus. 


The  performance  of  the  I E  of  the  ZENON  prototype  can 
evaluated  quantitatively  relative  to  the  corpus. 


♦ 


The  corpus  can  be  used 
(e.g.,  complexity  of  nominal  phrases,  word  sense 
disambiguation,  machine  learning  of  grammatical 
structures,  etc.). 


for  other  research  objectives 
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